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Abstract 

What if a successful company starts to receive a torrent of low-valued (one or two 
stars) recommendations in its mobile apps from multiple users within a short (say one 
month) period of time? Is it legitimate evidence that the apps have lost in quality, or 
an intentional plan (via lockstep behavior) to steal market share through defamation? 
In the case of a systematic attack to one’s reputation, it might not be possible to man¬ 
ually discern between legitimate and fraudulent interaction within the huge universe 
of possibilities of user-product recommendation. Previous works have focused on this 
issue, but none of them took into account the context, modeling, and scale that we 
consider in this paper. Here, we propose the novel method Online-Recommendation 
Fraud ExcLuder (ORFEL) to detect defamation and/or illegitimate promotion of online 
products by using vertex-centric asynchronous parallel processing of bipartite (users- 
products) graphs. With an innovative algorithm, our results demonstrate both efficacy 
and efficiency - over 95% of potential attacks were detected, and ORFEL was at least 
two orders of magnitude faster than the state-of-the-art. Over a novel methodology, our 
main contributions are: (1) a new algorithmic solution; (2) one scalable approach; and 
(3) a novel context and modeling of the problem, which now addresses both defama¬ 
tion and illegitimate promotion. Our work deals with relevant issues of the Web 2.0, 
potentially augmenting the credibility of online recommendation to prevent losses to 
both customers and vendors. 

Keywords: graphs, fraud detection, defamation, recommendation, Web 
2.0, data analysis 


1. Introduction 

In the Web 2.0, it is up to the users to provide content, like photos, text, recommen¬ 
dations and many other types of user-generated information. The more interaction, 
e.g., likes, recommendation, comments, etc., a product page (or a user profile) gets, the 
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better are the potential profits that a company (or an individual) may achieve with au¬ 
tomatic recommendation, advertisement, and/or priority in automatic search engines. 
In Google Play, for example, mobile apps heavily depend on high-valued (4 or 5 stars) 
recommendations to get more important and to expand their pool of customers; on 
Amazon, users are offered the most recommended products, that is, those that were 
better rated; and in TripAdvisor, users rely on other’s feedback to pick their next trav¬ 
els. The same holds for defamation, which is the act of lowering the rank of a product 
by creating artihcial, low-valued recommendations. Sadly, fraudulent interaction has 
come up in the Web 2.0 - fraudulent likes, recommendations, and evaluations dehne 
artihcial interests that may illegitimately induce the importance of online competitors. 

Attackers create illegitimate interaction by means of fake users, malware credential 
stealing, Web robots, and/or social engineering. The identihcation of such behavior has 
great importance to companies, not only because of the potential losses due to fraud, 
but also because their customers tend to consider the reliability of a given website 
as an indicator of trustfulness and quality. According to Facebook CD, fraudulent 
interaction is harmful to all users and to the Internet as a whole, so it is important that 
users have a true engagement around brands and content. 

However, catching up with such attacks is a challenging task, especially when there 
are millions of users and millions of products being evaluated in a system that deals 
with billions of interactions per day. In such attacks, multiple fake users interact with 
multiple products at random moments m in a way that their behavior is camouflaged 
among millions of legitimate interactions per second. The core of the problem is: how 
to track the temporal evolution of fraudulent user-product activity since the number of 
possible interactions is factorial? 

We want to identify the so-called lockstep behavior, i.e., groups of users acting to¬ 
gether, generally interacting with the same products at around the same time. As an 
example, imagine that an attacker creates a set of fake users to artihcially promote his 
e-commerce website; then, he would like to comment and/or recommend his own Web 
pages, posts, or advertisements to gain publicity that, fairly, should come from real cus¬ 
tomers. Here, an attacker may refer to employees related to a given company, profes¬ 
sionals (spammers) hired for this specific kind of job, Web robots, or even anonymous 
users. The weak point in all these possibilities is that the attacker must substantially 
interact with the attacked system within limited time windows; also, the attacker must 
optimize his efforts by using each fake user account to interact with multiple products. 
This behavior agrees with the lockstep dehnition. Note that this pattern is well-defined 
in online recommendation and in many other domains, such as academic co-citation, 
social network interaction, and search-engine optimization. Provided that this is not a 
new problem, we use in this paper the dehnition of lockstep behavior given by Beutel 
et al. 0. See the upcoming Section [j3] for details. 

The task of identifying locksteps is commonly modeled as a graph problem - nodes 
are either users or products; weighted edges represent recommendations - in which we 
want to detect near-bipartite cores considering a given time constraint. The bipartite 
cores correspond to groups of users that interacted with groups of products within lim¬ 
ited time intervals. One lockstep may be defamation, when the interactions are negative 
(low-valued) recommendations; or illegitimate promotion, when the recommendations 
are positive. Therefore, the problem generalizes to hnding near-bipartite cores with 


2 


edges whose weights correspond to the rank of the recommendations. Note that we 
want to tackle the problem without any previous knowledge about suspicious users, 
products, nor the moments when frauds occurred in the past. 

This work extends the state-of-the-art solutions for the problem of lockstep identi¬ 
fication. Our main contributions are threefold: 

1. Novel algorithmic paradigm: we introduce the first vertex-centric algorithm 
able to spot lockstep behavior in Web-scale graphs using asynchronous parallel 
processing; vertex-centric processing is a promising paradigm that still lacks 
algorithms specifically tailored to its modus operands, 

2. Scalability and accuracy: we tackle the problem for billion-scale graphs in 
one single commodity machine, achieving efficiency that is comparable to that 
achieved by state-of-the-art works on large clusters of computers, whilst obtain¬ 
ing the same efficacy; 

3. Generality of scope: we tackle the problem for real weighted graphs ranging 
from social networks to e-commerce recommendation, expanding the state-of- 
the-art of lockstep semantics to discriminate defamation and illegitimate promo¬ 
tion. 

This paper follows a traditional organization. Section|^presents background con¬ 
cepts, while Section|^reviews the related works. Our proposal is described in Section 

In Section]^ we report experimental results, including real data analyses. Finally, 
Section|^concludes the paper and presents ideas for future work. 

2. Background 

2.7. Vertex-centric graph processing 

We use in this paper the well-known concept of vertex-centric processing ll2^ . 
Given a graph G = {V,E) with vertices labeled from 1 to |y |, we associate a value to 
each vertice and to each edge - for a given edge e = {u,v), u is the source and v is the 
target. With values associated to vertices and edges, vertex-centric processing corre¬ 
sponds to the graph scan approach depicted in Algorithm[T] The values are determined 
according to the computation that is desired, e.g., Pagerank or belief propagation; we 
illustrate this fact with hypothetical functions / and g in the algorithm. Evidently, a sin¬ 
gle scan is not enough for most useful computations, therefore, the graph is commonly 
scanned many times until a criterion of convergence is satisfied. Graph processing, 
then, becomes what is defined in Algorithm]^ 

The vertex-centric processing paradigm contrasts with usual graph traversal algo¬ 
rithms, like breadth-first or depth-first searches. While traversal-based algorithms sup¬ 
port any kind of graph processing, they are made to work with the entire graph in main 
memory, otherwise, they would be prohibitively costly due to repeatedly random disk 
accesses. On the other hand, the vertex-centric processing is limited to problems that 
can be solved along the direct neighbors of the vertices (or with clever adaptations to 
such constraint); the good point is that it is well-suited to disk-based processing since 
it can suitably rely on sequential disk accesses. This kind of processing is not only 
prone to disk-based processing, but also to parallel processing according to which. 
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Algorithm 1 Vertex-centric graph processing 

procedure Graph_scan(Graph G) 
for / = 1 fo |y| do 

setg ^ set of edges adjacent to V[i\ 
'V[ii\.value f(sete) 
for each edge e in sete do 

e.value •<— g{y[i\.value^e.value) 


Algorithm 2 Graph processing 

procedure Graph_processing 

while convergence criterion is not satisfied do 
Graph_scan(G) 


each thread can be responsible for a different share of the vertices. This possibility 
yields to quite effective algorithms. 

2.2. Asynchronous parallel processing 

Many researchers have developed systems to process graphs in large-scale, ei¬ 
ther using vertex-centric or edge-centric processing; this is the case of systems Pregel 
ll23l . Pegasus El, PowerGraph El, and GraphLab El- However, such systems 
are parallel-distributed, and thus, they demand knowledge, availability, and manage¬ 
ment of costly clusters of computers. More recently, a novel paradigm emerged in the 
form of frameworks that rely on asynchronous parallel processing, including systems 
GraphChi fT9l , TurboGraph ifThll . X-Stream ll3^ and MMap El- Such systems use 
disk I/O optimizations and the neighborhood information of nodes/edges in order to set 
up algorithms that can work in asynchronous parallel mode; that is, it is not required 
that their threads advance synchronously along the graph in order to reach useful com¬ 
putation. This approach has demonstrated success to tackle many problems, such as 
Pagerank, connected components, shortest path, and belief propagation, to name a few. 
In this paper, we use vertex-centric graph processing over framework GraphChi; how¬ 
ever, our algorithm can be adapted to any of the frameworks available in the literature. 

3. Related works 

3.1. Clustering 

The identification of lockstep behavior refers to the problem of partitioning both 
the rows and the columns of a matrix - known in the literature as co-clustering or bi¬ 
clustering. Some authors have worked on similar variations of the bi-clustering prob¬ 
lem. For example, Papalexakis and Sidiropoulos used PARAFAC decomposition over 
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the ENRON e-mail corpus 1291 : Dhillon et al. used information theory over word- 
document matrices la, and Banerjee et al. used Bregman divergence for predicting 
missing values and for compression 0. Other applications include gene-microarray 
analysis, intrusion detection 1^ . natural language processing 1351, collaborative fil¬ 
tering m, and image 0, speech and video analysis in). Note that, in this problem 
setting, whenever time is considered in the form of time windows to be detected, we 
have a Non-deterministic Polynomial-time (NP-hard) problem ||2l that prevents the 
identification of the best solution even for small datasets. In fact, we deal with is¬ 
sues fundamentally different from the problems proposed so far, which, according to 
Kriegel etal. oa, are not straightly comparable due to their specificities. 

Theoretically, our work resembles the works of Gupta and Ghosh iflTl and of Cram¬ 
mer and Chechik Q; similarly, we use local clustering principles, but, differently, we 
are not dealing with one-class problems. Besides, the core of our technique is a variant 
of mean-shift clustering Q, now considering temporal and multi-dimensional aspects. 
As we mentioned before, our contribution relates not only to performance, but also to 
a novel algorithmic approach. 

3.2. Detection of suspicious behavior on the Web 

One of the first algorithms tailored to detect suspicious behavior on the Web was 
designed by Douceur Go) in 2002. The author coins the term sybil attack, in the 
specific context of peer-to-peer networks. Sybil attacks are attacks in which a single 
entity can provide multiple identities, that is, a single node in the network can create 
or steal several other identities and use them to gain advantages, thus, undermining the 
security of the whole system. Latter, Newsome et al. Il25l showed that sybil attacks can 
also occur in sensor networks where the attacker wants to bypass security measures, 
such as voting mechanisms and resource allocation policies. 

One similar type of attack was studied by Chirita et al. El - the shilling attack', 
in shilling attacks, fake profiles are used to rate items in a recommendation system. 
Chirita et al. E) proposed a technique to analyze profiles and to determine whether or 
not they are suspicious. Later, Su et al. El developed an algorithm to detect groups of 
shilling attacks, in which several profiles act in conjunction to alter the ratings of items 
in the system. 

While both sybil and shilling attacks are similar to the concepts that we propose 
in this paper, as in defamation and illegitimate promotion, none of the aforementioned 
works consider the temporal dimension to detect the attacks. Time, in such setting, 
leads to a different problem with NP-Hard complexity 1^ . Also, none of these works 
took into account the performance and scale that we consider in this paper. 

Other related works use the graph theory to detect suspicious behavior on the Web. 
This is the case of algorithm Crochet ll3ll that aims at identifying quasi-cliques based 
on an innovative heuristic; it is also the case of MultiAspectForensics ll24ll that uses 
tensor decomposition to detect patterns within communities, including bipartite cores. 
In another work, Eigenspokes 1^ uses singular-value decomposition to detect unex¬ 
pected patterns in phone call data; also, Netprobe EtII uses belief propagation to find 
near bipartite cores in e-commerce graphs. Note, however, that: in spite of the many 
qualities of these related works, none of them focuses on performance at the same scale 
that we do; furthermore, they do not study the same problem that we do here, that is. 
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the detection of a set of users fraudulently interacting with the same set of products 
at around the same time. In fact, the closest approach to our work is the CopyCatch 
algorithm HI, which focuses on the unweighted version of the problem in a parallel, 
distributed setting - its experimental results reported used one thousand machines. In 
our work, we introduce a vertex-centric asynchronous parallel algorithm that runs in 
one single commodity machine, whose performance rivals to that reported in this for¬ 
mer work, still achieving similar accuracy rates. 

3.3. Lockstep formulation 

In this section we provide a mathematical description of the lockstep detection 
problem. As it was mentioned in the introductory Section[T] we generalize the problem 
by amplifying its scope to defamation and illegitimate promotion. But before doing 
so, we present in the following the original formulation for the concept of a lockstep, 
which was formally defined by Beutel et al. 0 as a temporally-coherent near bipartite 
core. Along this section, please refer to Table[2for a list of symbols and definitions. 

Definition 1. A set of products P and a set of users U comprise an [n,m,Af,p]- 
temporally-coherent near bipartite core if and only if there exists P, C P for all i G U 
such that: 


|P| > m 
\U\ >n 

\Pi\>p\P\ V/Gt/ 
(ij) GEViGUJ GPi 


( 1 ) 

( 2 ) 

(3) 

(4) 

(5) 


3tj G K s.t. \tj-Lij\ < At V/ GUJ G Pi 


In other words, we have a lockstep if we find a set of products P that was recom¬ 
mended by a set of users U within a At time window; we relax this definition with 
parameter p, which states that we also have a lockstep if we partially (p percentage) 
satisfy this definition. Note that what makes the problem even more challenging is the 
temporal factor; also, note that the problem refers to reducing the search space of frauds 
by pointing out suspicious behaviors, which can turn out to be actually fraudulent, or 
not. Figure [T] illustrates the concept of a lockstep. It shows how bipartite cores are 
formed and it also highlights the independence of the time-windows that are exclusive 
to each product. 

While we are considering the aforementioned definition of suspiciousness, it is also 
important to discuss how effective it is in preventing malicious agents from manipulat¬ 
ing recommendations. In other words, we are interested in finding how much damage 
agents could inflict without being detected. The core fact in the definition of an attack 
is that: the smaller the attack, the smaller its harm; in consequence, while it is hard to 
detect very small attacks, they tend to have no use unless they occur in extremely high 
cardinality. The boundaries of this relation for a non-temporal version of the problem 
are an open problem, as it is discussed in previous works Il4l l3^ . The challenge be¬ 
comes even harder when time is considered, as we do in this paper, which sets up an 
extension of the problem. 
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Figure 1: Lockstep illustration: A group of users (1,2 and 3) recommends a group of 
products (A and B), within limited time-windows for each product, forming a bipartite 
core. 


In this context, it is possible to conclude that, while we may miss very small at¬ 
tacks and trying to detect them might raise the number of false positives, given the 
controlled conditions, an adversary can only do a limited amount of damage without 
getting caught. Finding an optimal strategy of attack is also related to the same open 
problem discussed before, and since we consider individual temporal centers for each 
lockstep, one cannot merely use common strategies, such as to wait for a given amount 
of time before attacking other products with the same users as he/she would be caught 
anyway. Additionally, the fine tuning of our algorithm’s parameters allows the user to 
detect different types of attacks, further diminishing the possibilities of an adversary to 
bypassing the system. 


4. Methodology 


4.1. The generalized lockstep problem 

This section shows how to enhance the potential semantics of the lockstep-detection 
problem by taking into account the weights of edges (e.g., recommendations’ scores). 


Given the formulation presented in Section 3.3 we propose new semantics to the 
problem by considering weights of edges to define the concepts of defamation - Equa¬ 
tion 1^ - and illegitimate promotion - Equation |7] These weights correspond, for in¬ 
stance, to the numeric evaluation (score) given by a user to a product in a recommen¬ 
dation website. Our formulation considers the weights to be positive integers, and we 
use a threshold K to distinguish between defamation and promotion. 


Wi.j<K,iGU,jGPi (6) 

Wij>K,iGU,jGPi (7) 

We consider the problem as an optimization problem, whose objective is to catch 
as many suspect users as possible, while only growing P until parameter m is satisfied. 


7 






















Symbol 

Definition 

M and N 

Number of nodes in each side of the bipartite graph. 

c 

Set of locksteps. 

I 

M X N adjacency matrix. 

L 

MxN matrix holding the timestamp of each edge. 

W 

MxN matrix holding the weight of each edge. 

U[c] and P[c] 

Set of users or products in lockstep c. 

m and n 

Minimum number of products and users in the lockstep to 
be considered valid. 

At 

Size of the timespan. 

p 

Threshold percentage that the cardinality of the sets of 
products and users must satisfy to be in a lockstep. 

nSeed 

Number of starting seeds for the algorithm to begin search¬ 
ing for locksteps. 

X and K 

Function and threshold used to define defamation and pro¬ 
motion. 

Vj 

Current average time of suspicious recommendations to 
product]. 


Table 1: Symbols and Definitions. 


Our objective function is in Equation]^ The goal is to find U[c\ and /"[c] to maximize 
the number of users and their interactions for a given cluster c. 


arg max ^ 1 c, 1 c, P [c]) 

U[c\,P[c\ i 

(8) 



where 

q{u,w,P[c\) = j 

f ff if (J = Y.jeP[c]tij<t>{Vj,Uj)X(wj) > pm 

1 0 otherwise 

(9) 

= j 

\ 1 if fv < Af 

1 0 otherwise 

(10) 


f 1 if ^7 > K- 

tor promotion 

0 otherwise 

(11) 


— for defamation 

0 otherwise 

(12) 


Equations!^ and [prefer to our definitions of illegitimate promotion and defama¬ 
tion, respectively, while Equationj^shows how we incorporate these weight constraints 
in the original problem, through the definition of a threshold function X. That is, we 
expand the formulation by including new information relative to the weight of these re¬ 
lationships, as well as incorporate such definitions in the objective function, effectively 
broadening the scope of the problem and its potential applications. 

4.2. Algorithm ORFEL 

In order to find locksteps, this section presents the Online-Recommendation 
Fraud ExcLuder (ORFEL), a novel, iterative algorithm that leverages the idea of 
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vertex-centric processing introduced in Section 2.1 to expand and improve both the 
scope (weighted graphs) and the efficiency (scalability on a single computer) of current 
state-of-the-art approaches. Each iteration of our algorithm executes two functions; 
updateProducts and updateUsers that, respectively, will add/remove products and 
users from a lockstep that is being identified. The algorithm iterates until convergence 
- that is, until sets P and U stabilize for all the locksteps that were found; we consider 
that a lockstep is stable when no new product or user enters or leaves the lockstep in 
one iteration, compared to the previous one. The full pseudo-code of ORFEL is in 
Algorithm]^ 


Initialization 

The algorithm relies on seeds to search the data space; the general idea is to have each 
seed inspecting its surroundings looking for one local maximum. Each seed in the 
algorithm corresponds to one potential lockstep, which comprises a set of products and 
a set of users. The initial seeds correspond to minimum locksteps, that is, locksteps 
with one single product, and a few (> 1) users each - the only requirement for the 
initial set of users is that it cannot be an empty set. 

The initialization step randomly chooses products of the dataset, each one corre¬ 
sponding to one seed. Then, for each product (seed), ORFEL forms initial locksteps by 
randomly choosing a constant, small number of users that recommended this product. 
This is necessary so that the algorithm has initial elements - P\i\ and U[i] for every 
ith-lockstep - scattered throughout the search space. Later, the initial locksteps will 
grow iteratively in number of products and users. 


Product update 

In procedure updateProducts - see Algorithm]^- we only consider vertices that are 
products, so modifications occur in set P[/] only. This function is called for every 
product to test if it fits in one of the locksteps; the test is performed for all locksteps. 
One product enters a given lockstep if at least p percent of the users currently in the 
lockstep recommended that product within a Af time window. To compute this percent¬ 
age, the algorithm only considers recommendations that fit the given weight constraint 
(represented by the A function), which characterizes either defamation or illegitimate 
promotion. 

For locksteps with m products, that is, those with the maximum number of 
products, we test if it is worth to swap one of its products for the candidate one. This 
test is similar to the one used to add a product, except that, to be swapped, now the 
candidate product must contain a superset of the set of recommendations that the 
current product has. This is a heuristic approach that leads to an additional coverage of 
the search space because, as we look for supersets of recommendations, the locksteps 
tend to increase in size. 


User update 

Procedure updateUsers - see Algorithm]^- considers only vertices that are users, so 
it modifies set U[i\ only. Similarly to what is done in step updateProducts, we update 
each lockstep separately by testing if the current user can be added to it. A candidate 
user will enter a lockstep if it recommends at least p percent of the products in the 
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cluster within a 25t time window of each of the products’ time centers - that is, the 
average recommendation time on that product inside the lockstep - and if it fits the 
desired weight constraint. If the candidate user fills the requirements, it is added to 
that lockstep. Note that this step allows users outside the actual 5t time window to 
enter the lockstep; this is the mechanism that drives the lockstep towards a better local 
maximum, whenever there exists one. We propose to use 25t, following empirical 
evidence obtained from both our work and the state-of-the-art Beutel’s S) approach. 

End iteration 

As it can be seen in Algorithm we run procedure enditeration right after step 
updateUsers is complete. This additional step is described in Algorithm]^ For all 
locksteps, the algorithm sorts the 2At recommendations by their timestamps and scans 
them sequentially looking for the subset that maximizes the recommendation criterion 
(number of recommendations); the target subset must fit a Af window. This is the core 
mechanism of our algorithm; what it does is to let a 2Af time window to take place 
at first, then, from the corresponding set of recommendations, it selects a subset that 
maximizes the target criterion. This mechanism is what makes the seeds “inspect” their 
2At neighborhoods. If the recommendation set changes, a new iteration will lend new 
products and users to entering/swapping into the lockstep, until convergence. Once a 
seed finds a local maximum, it stops evolving and does not change anymore. 

Note that some seeds may converge sooner than others, leading to locksteps smaller 
than parameters m and n. These seeds are considered “dead” (no modifications between 
iterations), so they are ignored by the algorithm. They can occur from the second 
iteration on, after which the number of locksteps (live seeds) becomes smaller than the 
initial number of seeds. The algorithm converges when all seeds are “dead”. 


Algorithm 3 ORFEL Algorithm, 
function ORFEL{n,m,p,At,nSeeds) 

Initialize U[nSeeds],P[nSeeds] > Initial Seeding 

repeat 
U' = U 
P' =P 

for each product p in |y | do 
P = updateProducts(p) 
for each user u in |y | do 
U = updateUsers(u) 
endIterationO 
until U' = U and P' = P 
return [U,P] 


4.3. Discussion about the parameters 

As it can be seen in Algorithm]^ ORFEL has five parameters: m, n, p. At and nSeeds. 
The first two, m and n, respectively refer to the cardinality of products and users that 
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Algorithm 4 updateProducts 

procedure UPDATEPRODUCTS(verfex) 
for each Lockstep c € C do 

Recomms ^ U[c\.edgesC\vertex.edges 
timeCenter ^ avgtime(Recomms) 
for each edge e in Recomms do 

if \e.time — timeCenter\ > Af and X{e.weight) then 
Recomms — Recomms — {e} 
if |P[c]| < m then 

if (\Recomms\/\U[c]\) > p then 
P[c] = P[c] U {vertex} 
else 

for each product p &P\c\ do 

if p .Recomms C Recomms then 
swap = p 

P[c] = (P[c] — {swap}) U {vertex} 


Algorithm 5 updateUsers 

procedure UPDATEUSERS(verfex) 
for each Lockstep c € C do 

Recomms ^ P[c] .edges H vertex.edges 
for each edge e in Recomms do 

pCenter ^ avgtime((u, e.vertex), u € t/[c]) 
if \e.time — pCenter\ > At and X{e.weight) then 
Recomms — Recomms — {e} 
if {\Recomms\/\P[c]\) > p then 
U[c] = f/[c] U {vertex} 


Algorithm 6 enditeration 

procedure endIteration 
for each Cluster c € C do 

for each product p G P[c] do 

Sort U[c] by the time of the Recomms 

Scan sorted U[c] for the 2At-subset that maximizes the number of 

Recomms 

Remove the users from U[c] that are not in the subset 
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the algorithm verifies when evaluating suspicious locksteps. Parameter p is the mini¬ 
mum percentage (the tolerance fraction) of products p * m for the algorithm to state that 
a bipartition is, in fact, suspicious. Although one can freely alter the value of p, based 
on empirical evidence, we suggest using no less than 80%, otherwise, the locksteps 
might degenerate. Note that we use a single value of p for both users and products, 
because, intuitively, this parameter is expected to be nearly the same for the two enti¬ 
ties; nevertheless, algorithmically, one could easily adapt our proposal to use different 
values at the cost of greater computational complexity. Parameter At defines the time 
window within which the interactions (recommendations) should take place. Finally, 
parameter nSeeds refers to the number of seeds that the algorithm will spread through 
the search space, each one looking for one suspicious lockstep. 

Parameters m and n define the aspects of the suspicious behaviors that we are look¬ 
ing for. Increasing (or decreasing) the value of m or n means that we want to find 
suspicious behaviors involving more (or less) products and/or users. These values 
define what we call “AttackSize”, i.e., the dimensions of the attacks that we presume 
to exist. In practice, parameters m and n filter out attacks that are too small and/or too 
large, what may be desired depending on the domain. In the experimental Section]^ we 
evaluate how distinct configurations of AttackSize impact the efficacy of our algorithm. 

Parameter p makes the algorithm flexible about different types of attacks, including 
those in which the users attack only a fraction (percentage) of the expected number of 
products m in the locksteps. The value of p defines how tolerant we want to be with 
respect to the very definition of suspiciousness. If we set p to 1, only perfect full 
locksteps would be considered, in which every user recommended every product of 
the cluster. On the other hand, if we set p too low, such as p = 0.5 for instance, 
only half of the users would have to recommend each product, possibly leading to 
incorrect assumptions about the concept of suspiciousness. In practice, p defines that 
the algorithm should have a tolerance around m and n. Note that parameters p, m and 
n depend on the semantics of the problem’s domain and ought to be different for each 
application. It is also true for parameter Ar, which we describe as follows. 

Parameter Af is the time span to be defined by the analyst when searching for at¬ 
tacks. For instance, let us assume the context of attacks in a social network; in this 
setting, one could argue that a time span as large as a couple of hours is enough to find 
ill-intended interactions. On the other hand, in the context of online reviews, the time 
span of one week could be more appropriate. Note that the same reasoning can also be 
used to define parameters m and n. 

Finally, parameter nSeeds controls ORFEUs potential of discovery; as we show in 
the experiments (see Figure]^, the minimum number of seeds required to analyze one 
given dataset follows a linear correlation with the data size. 

4.4. Convergence 

Our algorithm finds a set of local maxima for the objective function defined in 
Equation Note that this function is bounded, since the sets of users and prod¬ 
ucts are limited. Therefore, convergence depends solely on the behavior of steps 
updateProducts, updateUsers and enditeration. 

In step updateProducts, the algorithm checks if a given product should be added 
to any of the current locksteps, deciding to include or to swap that product only if it 
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covers more recommendations than what we have so far. As a result, the objective 
function may only improve or stay unaltered after this step. On the other hand, step 
updateUsers attempts to add suspect users to the existing locksteps by extending the 
size of the time-window, while step enditeration makes sure that only the largest set of 
users htting in the best Af time-window is added to each lockstep. As a consequence, 
these last two steps can only improve our objective function, by including more users, 
or leave it unaltered if no users are added. 

These observations lead us to conclude that the locksteps grow asymptotically in 
our algorithm, eventually reaching a local maximum that prevents changes between two 
iterations. Therefore, ORFEL always converges despite the data given as input. Be¬ 
sides this theoretical exercise, the convergence of our algorithm is empirically demon¬ 
strated in Section |5] 

4.5. Computational cost 

To study the computational cost of our algorithm, let us assume that the graph G 
received as input has size D bytes; we have M bytes available in main memory, and; the 
disk blocks have b bytes each. In this setting, ORFEL splits the graph into [P = D/M'\ 
parts. Each part contains edges that are sorted in disk according to their source vertices 
so that the graph is processed by reading the parts twice, hrst as targets and then as 
sources. Therefore, in order to read the entire graph, it is necessary to read B — D/b 
disk blocks twice, or 2B times. For each part it is also necessary to read the other 
P—\ parts, leading to disk seeks. Therefore, the cost of disk operations is given by 
P^ disk seeks + 2B block reads per iteration. 

ORFEL runs for I iterations. In each iteration, besides the disk operations, it runs 
once for each of the S seeds (worst case) processing in memory all the lEl edges of the 
graph at each time. Therefore, the processing cost of the algorithm is/*(9(5*|£’|). 

Each iteration of the algorithm asks for a reorganization step in which the locksteps 
of each seed are redehned based on the results annotated in the last iteration. For I 
iterations, S seeds, n users and m products, this step runs at cost I * 0{S * n * {m * 
log{m))). Part of this cost is due to the operation of sorting in memory (logarithmic 
time). This is the worst case scenario, when the algorithm processes all seeds - the 
cost drops abruptly after a few iterations because the majority of the seeds does not 
grow; instead, they stop evolving at a local maximum that is too small to be considered 
a lockstep, being ignored in further iterations. 

Finally, the total cost of ORFEL is / * {P^ disk seeks + 2B block reads + 0{S * 
iFl) -f 0{S *n*m* log{m))). Note that the cost of processing is irrelevant, since it 
is 6 orders of magnitude smaller than that of a mechanical disk and 4 orders smaller 
than that of a solid-state disk. As so, the main cost of ORFEL is / * {P^ disk seeks + 
2B block reads). Note, from our analysis, that the computational cost depends on the 
amount of main memory available, which is used as a buffer for data coming from disk; 
hence, all the runtime measurements reported in the next section could be smaller if we 
had more memory to use. 
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Dataset 

# Users 

#Products 

# Total nodes 

# Edges 

Amazon.FineFoods 

Amazon.Movies 

Synthetic.C 

256,059 

889,176 

2,000,000 

74,258 

253,059 

8,000,000 

330,317 

1,142,235 

10,000,000 

568,454 

7,911,684 

100,000,000 


Table 2: Datasets. 


5. Experiments 


5.1. Experimental setting 

We implemented ORE EL using Java 1.7 over the GraphChi platform, as stated in Sec- 
We ran our experiments on an i7-4770 machine with 16 GB of RAM, and 


2.2 


tion 

2TB 7200RPM HDD; for the tests with SSD, we used a 240GB drive with I/O at 
450MB/S. For full reproducibility, the complete experimental setup is publicly avail¬ 
able at www.icmc.usp.br/pessoas/junio/DRFEL/index.htm, including source 
codes and graph/lockstep generators. 

We studied two real-world graphs; Amazon.FineFoods wA Amazon.Movies. They 
are publicly available at the Snap project ll20ll web page at snap.stEinford.edu/ 
Both datasets comprise user-product recommendation data from the Amazon website; 
the first one refers to the section of fine food products and the second one contains 
reviews of movies. For each review, we have the corresponding timestamp and one 
numeric evaluation (score) ranging from 1 to 5. Synthetic graphs were also studied so 
to generalize the scope of our tests. To generate the data, we used a bipartite graph 
generator that works based on the Gnmk model available on NetworkX Da, in which 
n stands for the number of nodes in the hrst bipartite set; m stands for the number 
of nodes in the second bipartite set; and k is the number of randomly generated 
edges connecting both sets. Table lists the two Amazon datasets and the synthetic 
dataset Synthetic.C, which was generated using n — 2,000,000, m — 8,000,000 
and k — 100,000,000. Additionally, we generated benchmark datasets that are 
larger versions of dataset Synthetic.C', they were used to study the scalability of our 
algorithm, as it is described in Section 


Experimental goals 

The main feature expected from ORE EL is the ability to detect lockstep attacks, either 
those related to defamation or the ones of illegitimate promotion. As it was men¬ 
tioned in Section [XT] this is one NP-hard problem that we approximately solve via an 
optimization approach. Considering these aspects, we verify: the correctness of our 
algorithm in Section [X2j its efficacy (i.e., the ability to find the majority > 95% of the 
lockstep attacks) in Section [53] and; its efficiency (i.e., the ability to reach efficacy 
within desired time constraints) in Section [X4| 


5.2. Preliminar tests under controlled conditions 

In the first experiment, we used 4 small (thousand-edge scale) synthetic graphs to 
verify if the algorithm detects locksteps only when they really exist. These are the 
controlled conditions of our experimentation, that is, we wanted to make sure that the 
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Artificial Attack Size (Users,Products) Artificial Attack Size (Users,Products) 

(a) Amazon.FineFoods - Parameters (10,5,0.8,1000) (b) Amazon.Movies - Parameters (50,25,0.8,1000) 



Artificial Attack Size (Users,Products) 


(c) Synthetic.C - Parameters (50,25,0.8,3000) 

Figure 2; Experiments of efficacy: the percentage of attacks caught versus the size of the 
artificially generated attacks. Parameters are described as (n,m,p,nSeeds). 


algorithm would not point out suspect behaviors when we had ascertained that there 
were none to be detected. We generated synthetic graphs in which no product, nor set 
of products would configure a given suspect behavior, such as 10 products and 25 users. 
We then ran ORFEL with varying parameters, including m = 5, n = 10 and m= 10, 
n = 25 and verified that, as expected, no lockstep was detected. The same results could 
be inferred from the algorithm description in Section 4.2 and also from the discussion 
in Section 4.4 still, we verified this feature empirically. 


5.3. Efficacy 

We define efficacy as the ability to identify the majority (> 95%) of the locksteps. 
To test this feature, we created controlled conditions with artificial attacks appended 
to our datasets that allowed us to evaluate the output of the algorithm. Algorithm |7] 
shows how we generated such attacks, by randomly choosing a group of products and 
users and then connecting them within a limited At time window. This was necessary 
because, since the problem is NP-hard, we would not be able to know whether or 
not the output of the algorithm is correct considering uncontrolled conditions. This 
problem is a variation of subspace clustering, considering the semantic that the clusters 
(locksteps) are unusual and, therefore, suspect. Note that we did not focus on the issue 
of determining whether or not a given suspect lockstep is actually an attack, since this 
is one distinct problem that demands extra information (i.e., identification, customer 


15 










profile, and so on) to be evaluated by means of false-positive and true-positive rates. 


Algorithm 7 Lockstep Generator 

procedure L0CKSTEPPER(Graph G, nUsers, nPages, Af) 
users = GetRandomUsers(G, nUsers); 
pages = GetRandomPages(G, nPages); 
for each Page P G Pages do 

timestamp = getRandomTimeStampO; 
rating = getRandomRatingO; 
for each User U G Users do 

newTimeStamp = timestamp + getRandomVariation(Af); 
addEdge(G, U, P, newTimestamp, rating); 


Attack size 

In order to analyze the ability of the algorithm to hnd locksteps of different sizes, we 
ran experiments for each of the three datasets described in Table We hxed m and n 
in each case while varying the sizes of the artihcial attacks appended to the dataset, so 
to be able to see how effective the algorithm is, depending on the size of the attacks. 

In the hrst experiment, for each dataset of Table we appended artihcial attacks 
to the data with sizes varying from 10 users and 5 products to 1,000 users and 500 
products. This allowed us to observe the percentage of attacks caught for each con- 
hguration - in Figure]^ a), (n = 10,m = 5,p = 0.8,nSeeds = 1000); in Figure |^b), 
(n = 50,m = 25,p = 0.8,nSeeds = 1000); and, in Figure |^c), (n = 50,m = 25,p = 
Q.8,nSeeds = 3000). One can see a similar behavior in all plots; that is, the percent¬ 
age of users caught tends to grow as their sizes become larger than the size described 
by the input parameters given to the algorithm. Intuitively, the larger the attacks are, 
the more likely that they will be detected using a given conhguration. Concomitantly, 
the smaller the attacks, the less harmful they are. Lastly, notice that even the smaller 
attacks could be detected with proper parameters - in this experiment, however, we 
wanted to demonstrate the general behavior of the algorithm when using specihc pa¬ 
rameter settings, and not whether or not smaller attacks could be detected. 

This experiment also indicates that ORFEL behaves as expected for such task in 
terms of efficacy. That is, if we compare the behavior of ORFEL with that of the 
state-of-the-art algorithm CopyCatch - see Figure 6b at 11 - one can see that both 
approaches present a very similar curve for spotting artihcial attacks according to the 
attack size and the algorithm parameterization. The results in H maintain the same 
intuition regarding which attacks are easier to detect and how important is the parame¬ 
ter tuning, therefore, they corroborate our conclusions with regard to ORFEL s efficacy. 

Number of seeds 

We also used our three datasets to study the behavior of ORFEL regarding the number 
of seeds that it uses. We ran each experiment 4 times in each dataset - as the algorithm 
is non-deterministic - and report the average response. It was a requirement that none 
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of the 4 runs would present discrepant results, and we were able to verify this desirable 
property since the variance of the results was on the order of 1%. We introduced 20 
synthetic attacks (10 defamations and 10 illegitimate promotions) in each dataset and 
varied the number of seeds from 1,000 to 7,000. Figure [^reports the results obtained 
in these experiments. For dataset Amazon.FineFoods - the smaller one with 550 K 
edges - we were able to catch over 95% of the attacks with 4,000 seeds. Interestingly, 
the average of attacks caught with 5,000 seeds was signihcantly lower, indicating that 
the algorithm reached its peak performance with nearly 4,000 seeds and only had some 
variation afterward due to its non-determinism. Figure also reports that the algo¬ 
rithm caught over 95% of the attacks with 6,000 seeds in dataset Amazon.Movies (8 M 
edges), and; it caught over 95% of the attacks with 7,000 seeds in dataset Synthetic.C 
(100 M edges). These results indicate that the best number of seeds follows a linear 
correlation with the data size, being approximately 10^ * log{number of edges). For 
our 3 datasets, it is ^ 5800, ^ 6900 and ~ 8000 respectively. 



Number of Seeds 


Figure 3: Experiments of efficacy: the percentage of attacks caught versus the number 
of seeds. Efficacy is demonstrated when over 95% of the attacks are caught. Parameters 
[n,m,p,AttackSize(Users,Products)] are: Synthetic.C [50,25,0.8,(750,375)]; Amazon.Movies 
[50,25,0.8,(500,250)1; Amazon.EineFoods [10,5,0.8,(50,25)]. 

From this experiment we verified that ORFEL is effective; it identified more than 
95% of the attacks in three datasets of different sizes. Also, notice that the algorithm 
accurately detected attacks of distinct sizes, as it can be seen in the parameters of 
Figure]^ It means that ORFEL fits the peculiarities of distinct domains. 

Experiments on real data 

We performed additional experiments using our real datasets Amazon.Movies and 
Amazon.FineFoods, this time without including any synthetic data. In this context, 
we considered that a suspect behavior would be 20 users positively recommending 6 
movies or food products in less than a week, which, in this semantic context, is an in¬ 
tense load of recommendations. We ran the algorithm and found 37 suspect locksteps; 
it took 8 minutes to achieve convergence for the largest dataset, Amazon.Movies. Since 
the execution time was quite small, we also tested the algorithm using variations of the 
initial attack description, with 15 users and 7 movies within a week, and also 10 users 
and 10 movies within three days. After manually analyzing the suspect locksteps, we 
discovered that they were caused by amazon’s policy of using different identification 
numbers for different flavors/sizes of the same food product, and for different versions 
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of the same movie, while merging their reviews. Although the locksteps found were 
not actual attacks, this simple experiment revealed a behavior that should be better an¬ 
alyzed since the aforementioned Amazon’s policy could eventually lead customers to 
misleading choices. In summary, we were able to identify, in a universe of 1,140,000 
nodes and 8,000,000 edges, tiny temporal patterns that demand close attention to be 
noticed. We also emphasize the reduced time required to obtain such results, which 
allowed us to study different parameter settings according to the data semantics. 

5.4. Efficiency and scalability 

Due to the current scale of network-like data, our method must be efficient. That is, 
it must handle billion-scale graphs in reasonable time. We tested this requirement 
with synthetic, benchmark datasets that are larger versions of dataset Synthetic.C. 
Although there are plenty of real data related to our problem (network data, including 
edge weights and time stamps), such data is rarely shared by companies due to privacy 
matters. 


Preprocessing 

Asynchronous Parallel Processing platforms like those reviewed in Section 2.2 demand 
a preprocessing step in which the data is organized and formatted in accordance to 
the platform’s paradigm. In our case, this step converts text to binary data, then it 
sorts and writes the vertices in order, so to have them read from disk with sequential 
scans, minimizing the number of seeks. We take nearly 45 minutes, wall-clock time, 
to preprocess 1 billion edges on a mechanical disk, and nearly 15 min. on a solid-state 
disk - for a given dataset, preprocessing is necessary only once, no matter how many 
times we shall process the data later on. 


Number of edges 

We tested the time scalability of our algorithm regarding the number in edges of the 
input graph. In the first experiment, we ran ORFEL with 100 seeds; we took 7 runtime 
measurements with the number of edges varying from 50 million to 1 billion, each of 
these measures were obtained as the average of 3 individual runs. Figure reports 
the results for the mechanical disk; clearly, the runtime scales linearly with regard to 
the number of edges. For this configuration, ORFEL took 143 min. («2.38 hour) to 
process 1 billion edges stored on a mechanical disk, and 78 min. («1.3 hour) using a 
solid-state disk. 

We argue that this performance is very efficient because the previous work (see 
Figure 4a in Beutel et al. ID) took «0.5 hour to do a similar processing with 
one thousand machines over MapReduce, while we used one single commodity 
machine. Our gain in performance is considerable because the former work executes 
a sequential (non-parallel) algorithm to compute one seed at a time in each machine; 
therefore, performance comes at the cost of using thousands of machines, each 
one executing an instance of the computation, in a distributed environment that has 
heavy communication demands. Differently, our algorithm explores the fact that 
the problem can be solved considering only the neighborhood of each node, thus 
allowing us to process the graph in a parallel asynchronous mode with multiple seeds 
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being processed simultaneously. Note that our gains in performance were not only re¬ 
markable - they also made it possible to tackle the problem using commodity hardware. 



Number of Edges (millions) 


Figure 4: Experiments of scalability on the number of edges; linear growth of computation 
time (100 seeds) versus the number of edges for mechanical disk (HDD) and for solid-state disk 
(SSD). Parameters [n,m,p,AttackSize(Users,Products),nSeeds] are: [50,25,0.8,(500,250),100]. 

Number of seeds 

We also studied the runtime of ORFEL with distinct numbers of seeds. In this exper¬ 
iment, we used a graph with 100 million edges varying the number of seeds from 100 
to 5,000. Figure [^reports the results; as it can be seen, our proposed method scaled 
linearly in time with regard to the number of seeds. The algorithm took 10 min. to 
process the data using 100 seeds, while it took 298 min. with 5,000 seeds; that is a 
30-times increase in runtime for a 50-times increase in the problem input. 



Figure 5: Experiments of scalability on the number of seeds: linear growth (line coefficient < 
1) of computation time versus the number of seeds for mechanical disk (HDD) over 100 million 
edges. Parameters [n,m,p,AttackSize(Users,Products)] are: [50,25,0.8,(500,250)]. 


6. Conclusions and future work 

6.1. Conclusions 

We conclude that, although the problem of detecting fake online interaction over 
time is NP-hard, it is possible to timely detect most of the malicious activities even 
with low-cost computer machinery. To do so, we designed a vertex-centric-based graph 
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algorithm using the asynchronous parallel processing paradigm. The general problem 
is modeled as a bipartite weighted and timestamped graph from which we want to 
detect temporal near-bipartite cores. This model suits to problems of systematic attacks 
aimed at defaming or promoting online entities in applications of high-impact, e.g., 
user-product recommendations, user-app evaluations and journal-journal co-citations, 
which may be performed by means of fake users, malware credential stealing, Web 
robots, and/or social engineering. To validate our proposal, we studied two real graphs 
of e-commerce, besides synthetic data. 

Finally, we emphasize the importance of detecting lockstep behavior, either for 
defamation or promotion, because these frauds may harm both customers and vendors 
by inducing sales of unverified products. We also note that this problem gets even more 
relevant as the Web 2.0 expands, in which the habits of the users are heavily influenced 
by online trading and recommendation. 

6.2. Future work 

First, an interesting direction would be further analyzing the boundaries of poten¬ 
tial damage an attacker could inflict without being detected, given by the compromise 
between the size of the attack and the number of attacks. Which as stated in l3.3l is an 
open problem. Also, as we mentioned before in Section [T] ORFEL suits other prob¬ 
lems that can be represented as graphs, lending support to additional applications as 
detailed next. 

6.2.1. Social networks 

In social networks, the usual interaction is to like a given post, such as in Google-t or in 
Facebook; for this configuration, locksteps characterize solely illegitimate promotion, 
in which a given post (or page) gets fake likes from attackers willing to make it more 
relevant than it really is. According to the model of ORFEL, this problem refers to one 
unweighted bipartite graph, i.e., all edges weight the same. 

6.2.2. Journal co-citations 

Given the pressure for relevance and impact, some scientific journals may use a co¬ 
citation scheme in which one journal cites the other and vice-versa, just like in the case 
spotted by Nature in 2013 ll26l . According to this scheme, which is one variant of the 
lockstep behavior, a journal tends to favor papers that cite a specific journal; editors 
may even recommend authors what to cite in their work as a condition for publication. 

To identify this kind of lockstep behavior is not a trivial task because systematic co¬ 
citation tends to “disappear” along years of publications, provided that such schemes 
are usually covered by the volume of legitimate citations and by the magnitude of time. 
For example, it is reasonable to have co-citation between any two journals in a period 
of 10 years. The problem becomes even harder if more than two journals - e.g., three or 
four journals - set up the scheme. In this case, a simple journal-to-journal interaction 
may not be sufficient to detect the scheme. The temporal factor and the volume of data 
make it a problem much harder than simply detecting bipartite subgraphs. 

This problem is another instance of the lockstep detection problem studied in our 
work. With ORFEL, it is possible to spot co-citation occurring, let us say, within 
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periods of 1 or 2 years for any number of journals. For time intervals such as those, 
one may suspect if a set of journals cite each other with high intensity. 

In this specific case, our problem formulation changes a little. Our model assumes 
users recommending products - a bipartite graph; for detecting journal co-citations 
we must replicate the set of journals under investigation. That is, each journal must 
be represented twice in the model; once as a citing journal and one other time as 
a cited journal, thus, defining a bipartite graph as expected by our algorithm. The 
output of ORFEL, then, shall present bipartite subgraphs. However, distinctly from 
the user-product model, it is not enough to identify bipartite subgraphs as an indication 
of fraud; we must also have a high similarity between the two sets of nodes in each 
subgraph. This similarity can be straightly evaluated using the Jaccard set similarity: 
Jaccord = \set\ Fset 2 \/\seti Uset 2 \, which returns 1 if two sets are exactly the same, 
and 0 if they have no intersection. For the co-citation problem, our algorithm could be 
configured to return the set of bipartite subgraphs ordered by their Jaccard similarity. 
Of course, ORFEL spots behaviors that are solely suspicious - not definitive frauds; 
they must go through human interpretation for a definitive decision, considering, for 
example, that it is expected that the journals with very high impact rates cite each other, 
while the same behavior is not expected for journals with lower impact rates. 

Note that our algorithm cannot only detect suspicious co-citation cases - it can do 
it very efficiently. Since ORFEL is fast and scalable, it can virtually inspect aU the 
publication interaction ever produced in just a few hours. 
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